Walmart | Data Engineer - III Interview Experience | 2.5 YoE



Round 1 – SQL, DSA, and PySpark (Technical)

General Introduction

🔹 Discussed my previous tech stack and experiences.

Questions on:

🔹 Data volumes handled

🔹 Type of work done

🔹 Business impact of the work

🔹 SQL Questions

Identify continuous date ranges having the same status

Input:

Expected Output:

Approach: Use ROW_NUMBER() with grouping to find gaps and identify ranges.

Join Output Count

Given 2 tables with NULL values, determine output row count for different joins:

INNER JOIN

LEFT JOIN

RIGHT JOIN

FULL OUTER JOIN

🔹 DSA Coding Question

Group Anagrams

Input: ["eat", "tea", "tan", "ate", "nat", "bat"]

Output: [["bat"], ["tan", "nat"], ["eat", "tea", "ate"]]

Approach: Use a dictionary with sorted strings as keys to group anagrams.

PySpark

🔹 Union of Two Datasets with Different Schemas

🔹 Write a PySpark script to union two datasets having different:

Row counts

Column names

Schemas

Approach: Use column alignment and selectExpr or unionByName with allowMissingColumns=True.

Round 2 – PySpark and SQL

PySpark Tasks

🔹 Word Frequency in Text File

🔹 Read a text file using PySpark.

🔹 Return each word and its frequency.

🔹 Approach: Use flatMap, map, reduceByKey.

🔹 Spark Internals

🔹 Explain Spark architecture.

🔹 Jobs vs Stages vs Tasks.

Given code snippets

🔹 Identify narrow vs wide transformations.

🔹 Determine number of stages.

🔹 Employee Salary and Department Name

Task: Compute yearly salary for each employee and their department name.

SQL Tasks

🔹 Left Anti Join

🔹 Find number of records present in table 1 but not in table 2.

Data Modelling Task

🔹 Given: Customer, Product, and Orders tables.

🔹 Identify dimension vs fact tables.

🔹 Validate schema:

Fact PK should not be FK in a dim table

Round 3 – Spark and SQL

Spark Discussion

🔹 Spark architecture end-to-end

Spark memory management:

🔹 For small clusters, small data, and large data

Hardcoded reserved memory:

🔹 Can it be modified?

🔹 When should it be changed?

Spark 2 vs Spark 3:

🔹 AQE (Adaptive Query Execution)

🔹 Dynamic Partition Pruning

🔹 Pandas UDFs

Cloud Cost Optimization:

🔹 Storage

🔹 Compute

🔹 SQL Scenario

Problem:

From sales and refund tables, find customers who:

Cancelled an order (order_status = 'cancelled')

Did NOT get a refund

Placed another order after cancellation

Schema:

Sales Table:

| customer_id | order_id | order_plcd_ts | order_shipped_dt | order_amt | order_status |

Refund Table:

| order_id | order_refund_dt | refund_amt |

🔹 Git Question

git fetch vs git pull – conceptual and practical differences

Round 4 – Hiring Manager

Discussion Topics

🔹 Detailed discussion on past projects

🔹 CI/CD and pipeline automation

🔹 Airflow setup in AWS (MWAA)

🔹 Repartition vs Coalesce

🔹 When and why to use

🔹 Pros & cons

🔹 Role responsibilities and team expectations


Received the Final Offer